Adaptive Naïve Bayesian Anti-Spam Engine

نویسنده

  • Wojciech P. Gajewski
چکیده

The problem of spam has been seriously troubling the Internet community during the last few years and currently reached an alarming scale. Observations made at CERN (European Organization for Nuclear Research located in Geneva, Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. A naïve Bayesian classifier based on a Bag Of Words representation of an email is widely used to stop this unwanted flood as it combines good performance with simplicity of the training and classification processes. However, facing the constantly changing patterns of spam, it is necessary to assure online adaptability of the classifier. This work proposes combining such a classifier with another NBC (naïve Bayesian classifier) based on pairs of adjacent words. Only the latter will be retrained with examples of spam reported by users. Tests are performed on considerable sets of mails both from public spam archives and CERN mailboxes. They suggest that this architecture can increase spam recall without affecting the classifier precision as it happens when only the NBC based on single words is retrained. Keywords—Text classification, naïve Bayesian classification, spam, email.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Spam Detection Using Machine Learning in Specific Domain Features

In the last few years, as Internet usage becomes the main artery of the life's daily activities, the problem of spam becomes very serious for internet community. Spam pages form a real threat for all types of users. This threat proved to evolve continuously without any clue to abate. Different forms of spam witnessed a dramatic increase in both size and negative impact. A large amount of E-mail...

متن کامل

A Neural Network Classifier for Junk E-Mail

Most e-mail readers spend a non-trivial amount of time regularly deleting junk e-mail (spam) messages, even as an expanding volume of such e-mail occupies server storage space and consumes network bandwidth. An ongoing challenge, therefore, rests within the development and refinement of automatic classifiers that can distinguish legitimate e-mail from spam. A few published studies have examined...

متن کامل

SMS Spam Filtering Technique Based on Artificial Immune System

The Short Message Service (SMS) have an important economic impact for end users and service providers. Spam is a serious universal problem that causes problems for almost all users. Several studies have been presented, including implementations of spam filters that prevent spam from reaching their destination. Naïve Bayesian algorithm is one of the most effective approaches used in filtering te...

متن کامل

Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM

Abstract Spam is unsolicited, junk email with variety of shapes and forms. To filter spam, various techniques are used. Techniques like Naïve Bayesian Classifier, Support Vector Machine (SVM) etc. are often used. Also, a number of tools for spam filtering either paid or free are available. Amongst all techniques SVM is mostly used. SVM is computationally intensive and for training it can’t work...

متن کامل

AN EVALUATION OF FILTERING TECHNIQUES IN A NAÏVE BAYESIAN ANTI-SPAM FILTER by

An efficient anti-spam filter that would block all unsolicited messages i.e. spam, without blocking any legitimate messages is a growing need. To address this problem, this report takes a statistically-based approach, employing a Bayesian anti-spam filter, because it is content-based and self-learning (adaptive) in nature. We train the filter, using a large corpus of legitimate messages and spa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005